Under the hood
##############

This chapter gives insight into the inner workings of the Genode OS
framework. In particular, it explains how the concepts explained in Chapter
[Architecture] are realized on different kernels and hardware platforms.


Component-local startup code and linker scripts
===============================================

All Genode components including core rely on the same startup code, which
is roughly outlined at the end of Section [Component creation]. This
section revisits the required steps in more detail and refers to the corresponding
points in the source code. Furthermore, it provides background information
about the linkage of components, which is closely related to the startup
code.


Linker scripts
~~~~~~~~~~~~~~

Under the hood, the Genode build system uses three different linker scripts
located at _repos/base/src/ld/_:

:genode.ld: is used for statically linked components, including core,

:genode_dyn.ld: is used for dynamically linked components, i.e., components
  that are linked against at least one shared library,

:genode_rel.ld: is used for shared libraries.

Additionally, there exists a special linker script for the dynamic linker
(Section [Dynamic linker]).

Each program image generated by the linker generally consists of three parts,
which appear consecutively in the component's virtual memory.

# A read-only "text" part contains sections for code, read-only
  data, and the list of global constructors and destructors.

  The startup code is placed in a dedicated section '.text.crt0', which
  appears right at the start of the segment. Thereby the link address of
  the component is known to correspond to the ELF entrypoint (the first
  instruction of the assembly startup code).
  This is useful when converting the ELF image of the base-hw version of
  core into a raw binary. Such a raw binary can be loaded directly into
  the memory of the target platform without the need for an ELF loader.

  The mechanisms for generating the list of constructors and destructors
  differ between CPU architecture and are defined by the architecture's
  ABI. On x86, the lists are represented by '.ctors.*' and '.dtors.*'.
  On ARM, the information about global constructors is represented by
  '.init_array' and there is no visible information about global destructors.

# A read-writable "data" part that is pre-populated with data.

# A read-writable "bss" part that is not physically present in the binary but
  known to be zero-initialized when the ELF image is loaded.

The link address is not defined in the linker script but specified as
linker argument. The default link address is specified in a platform-specific
spec file, e.g., _repos/base-nova/mk/spec/nova.mk_ for the NOVA platform.
Components that need to organize their virtual address space in a special
way (e.g., a virtual machine monitor that co-locates the guest-physical
address space with its virtual address space) may specify link addresses
that differ from the default one by overriding the LD_TEXT_ADDR value.


ELF entry point
---------------

As defined at the start of the linker script via the ENTRY directive, the
ELF entrypoint is the function '_start'. This function is located at the very
beginning of the '.text.crt0' section. See the Section [Startup code] for
more details.


Symbols defined by the linker script
------------------------------------

The following symbols are defined by the linker script and used by the
base framework.

:'_prog_img_beg, _prog_img_data, _prog_img_end':
  Those symbols mark the start of the "text" part, the start of the "data"
  part (the end of the "text" part), and the end of the "bss" part.
  They are used by core to exclude those virtual memory ranges from
  the core's virtual-memory allocator (core-region allocator).

:'_parent_cap, _parent_cap_thread_id, _parent_cap_local_name':
  Those symbols are located at the beginning of the "data" part.
  During the ELF loading of a new component, the parent writes
  information about the parent capability to this location (the start
  of the first read-writable ELF segment). See the corresponding code
  in the 'Loaded_executable' constructor in _base/src/lib/base/child_process.cc_.
  The use of the information depends on the base platform. E.g.,
  on a platform where a capability is represented by a tuple of a global
  thread ID and an object ID such as OKL4 and L4ka::Pistachio, the
  information is taken as verbatim values. On platforms that fully
  support capability-based security without the use of any form of
  a global name to represent a capability, the information remains unused.
  Here, the parent capability is represented by the same known
  local name in all components.

Even though the linker scripts are used across all base platforms, they
contain a few platform-specific supplements that are needed to support
the respective kernel ABIs. For example, the definition of the symbol
'__l4sys_invoke_indirect' is needed only on the Fiasco.OC platform and
is unused on the other base platforms. Please refer to the comments
in the linker script for further explanations.


Startup code
~~~~~~~~~~~~

The execution of the initial thread of a new component starts at the ELF
entry point, which corresponds to the '_start' function. This is an
assembly function defined in _repos/base/src/lib/startup/spec/<arch>/crt0.s_
where _<arch>_ is the CPU architecture (x86_32, x86_64, or ARM).


Assembly startup code
---------------------

The assembly startup code is position-independent code (PIC).
Because the Genode base libraries are linked against both statically-linked
and dynamically linked executables, they have to be compiled as PIC code.
To be consistent with the base libraries, the startup code needs to be
position-independent, too.

The code performs the following steps:

# Saving the initial state of certain CPU registers. Depending on the
  used kernel, these registers carry information from the
  kernel to the core component. More details about this information
  are provided by Section [Bootstrapping and allocator setup]. The
  initial register values are saved in global variables named
  '_initial_<register>'. The global variables are located in the BSS
  segment. Note that those variables are used solely by core.

# Setting up the initial stack. Before the assembly code can call any
  higher-level C function, the stack pointer must be initialized to
  point to the top of a valid stack. The initial stack is located in the
  BSS section and referred to by the symbol '_stack_high'. However,
  having a stack located within the BSS section is dangerous. If it
  overflows (e.g., by declaring large local variables, or by recursive
  function calls), the stack would silently overwrite parts of the
  BSS and DATA sections located below the lower stack boundary. For prior
  known code, the stack can be dimensioned to a reasonable size. But
  for arbitrary application code, no assumption about
  the stack usage can be made. For this reason, the initial stack cannot
  be used for the entire lifetime of the component. Before any
  component-specific code is called, the stack needs to be relocated to
  another area of the virtual address space where the lower bound of the
  stack is guarded by empty pages. When using such a "real" stack, a
  stack overflow will produce a page fault, which can be handled or at least
  immediately detected. The initial stack is solely used to perform the
  steps required to set up the real stack. Because those steps are the same for
  all components, the usage of the initial stack is bounded.

# Because the startup code is used by statically linked components as well as
  the dynamic linker, the startup immediately calls the 'init_rtld' hook
  function.
  For regular components, the function does not do anything. The default
  implementation in _init_main_thread.cc_ at  _repos/base/src/lib/startup/_ is a weak
  function. The dynamic linker provides a non-weak implementation, which
  allows the linker to perform initial relocations of itself very early at
  the dynamic linker's startup.

# By calling the 'init_main_thread' function defined in
  _repos/base/src/lib/startup/init_main_thread.cc_, the assembly code triggers
  the execution of all the steps needed for the creation of the real stack.
  The function is implemented in C++, uses the initial stack, and returns
  the address of the real stack.

# With the new stack pointer returned by 'init_main_thread', the assembly
  startup code is able to switch the stack pointer from the initial stack to
  the real stack. From this point on, stack overflows cannot easily corrupt
  any data.

# With the real stack in place, the assembly code finally passes the control
  over to the C++ startup code provided by the '_main' function.


Initialization of the real stack along with the Genode environment
------------------------------------------------------------------

As mentioned above, the assembly code calls the 'init_main_thread' function
(located in _repos/base/src/lib/startup/init_main_thread.cc_) for setting up the
real stack for the program. For placing a stack in a dedicated portion of the
component's virtual address space, the function needs to overcome two
principle problems:

* It needs to obtain the backing store used for the stack, i.e.,
  allocating a dataspace from the component's PD session as initialized
  by the parent.

* It needs to preserve a portion of its virtual address space for placing
  the stack and make the allocated memory visible within this portion.

In order to solve both problems, the function needs to obtain the capability
for its PD session from its parent. This comes down to
the need to perform RPC calls. First, for requesting the PD
session capability from the parent, and second, for invoking the session
capability to perform the RAM allocation and region-map attach operations.

The RPC mechanism is based on C++. In particular, the mechanism supports
the propagation of C++ exceptions across RPC interfaces. Hence,
before being able to perform RPC calls, the program must initialize
the C++ runtime including the exception-handling support.
The initialization of the C++ runtime, in turn, requires support for
dynamically allocating memory. Hence, a heap must be available.
This chain of dependencies ultimately results in the need to construct the
entire Genode environment as a side effect of initializing the real stack of
the program.

During the construction of the Genode environment, the program requests its
own CPU, PD, and LOG sessions from its parent.

With the environment constructed, the program is able to interact
with its own PD session and can principally realize the
initialization of the real stack. However, instead of merely allocating
a new RAM dataspace and attaching the dataspace to the address space of the
PD session, a so-called stack area is used. The stack area
is a secondary region map that is attached as a dataspace to the component's
address-space region map.
This way, virtual-memory allocations within the stack area can be
managed manually. I.e., the spaces between the stacks of different threads are
guaranteed to remain free from any attached dataspaces.
The stack area of a component is created as part of the component's PD
session. The environment initialization code requests its region-map
capability via 'Pd_session::stack_area' and attaches it as a managed dataspace
to the component's address space.


Component-dependent startup code
--------------------------------

With the Genode environment constructed and the initial stack switched
to a proper stack located in the stack area, the component-dependent
startup code of the '_main' function in _repos/base/src/lib/startup/_main.cc_ can be
executed. This code is responsible for calling the global constructors
of the program before calling the program's main function.

In accordance to the established signature of the 'main' function, taking
an argument list and an environment as arguments, the startup code supplies
these arguments but uses dummy default values. However, since the values
are taken from the global variables 'genode_argv', 'genode_argc', and
'genode_envp', a global constructor is able to override the default values.

The startup code in '_main.cc' is accompanied with support for _atexit_
handling. The atexit mechanism allows for the registration of handlers
to be called at the exit of the program. It is provided in the form of
a POSIX API by the C runtime. But it is also used by the compiler to
schedule the execution of the destructors of function-local static objects.
For the latter reason, the atexit mechanism cannot be merely provided
by the (optional) C runtime but must be supported by the base library.


C++ runtime
===========

Genode is implemented in C++ and relies on all C++ features required to use
the language in its idiomatic way. This includes the use of exceptions
and runtime-type information.


Rationale behind using exceptions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Compared to return-based error handling as prominently used in C programs, the
C++ exception mechanism is much more complex. In particular, it requires the use
of a C++ runtime library that is called as a back-end by the exception handling code
and generated by the compiler. This library contains the functionality needed to
unwind the stack and a mechanism for obtaining runtime type
information (RTTI). The C++ runtime libraries that come with common tool
chains, in turn, rely on a C library for performing dynamic memory
allocations, string operations, and I/O operations. Consequently, C++ programs
that rely on exceptions and RTTI implicitly depend on a C library. For this
reason, the use of those C++ features is universally disregarded for low-level
operating-system code that usually does not run in an environment where a
complete C library is available.

In principle, C++ can be used without exceptions and RTTI (by passing the
arguments '-fno-exceptions' and '-fno-rtti' to GCC). However, without
those features, it is hardly possible to use the language as designed.

For example, when the operator 'new' is used, it performs two steps:
Allocating the memory needed to hold the to-be-created object and calling
the constructor of the object with the return value of the allocation
as 'this' pointer. In the event that the memory allocation fails, the only
way for the allocator to propagate the out-of-memory condition is throwing an
exception. If such an exception is not thrown, the constructor would be
called with a null as 'this' pointer.

Another example is the handling of errors during the construction of an
object. The object construction may consist of several consecutive
steps such as the construction of base classes and aggregated objects.
If one of those steps fails, the construction of the overall object remains
incomplete. This condition must be propagated to the code that issued the
object construction. There are two principle approaches:

# The error condition can be kept as an attribute in the object. After
  constructing the object, the user of the object may detect the error
  condition by requesting the attribute value.
  However, this approach is plagued by the following problems.

  First, the failure of one step
  may cause subsequent steps to fail as well. In the worst case, if the
  failed step initializes a pointer that is passed to subsequent
  steps, the subsequent steps may use an uninitialized pointer. Consequently,
  the error condition must eventually be propagated to subsequent steps,
  which, in turn, need to be implemented in a defensive way.

  Second, if the construction failed, the object exists but it is inconsistent.
  In the worst case, if the user of the object misses to check for the
  successful construction, it will perform operations on an inconsistent
  object. But even in the good case, where the user detects the
  incomplete construction and decides to immediately destruct the object, the
  destruction is error prone.
  The already performed steps may have had side effects such as resource
  allocations. So it is important to revert all the successful steps by
  invoking their respective destructors. However, when destructing the
  object, the destructors of the incomplete steps are also called.
  Consequently, such destructors need to be implemented in a defensive
  manner to accommodate this situation.

  Third, objects cannot have references that depend on potentially failing
  construction steps. In contrast to a pointer that may be marked as
  uninitialized by being a null pointer, a reference is, by definition,
  initialized once it exists. Consequently, the result of such a step can
  never be passed as reference to subsequent steps. Pointers must be used.

  Fourth, the mere existence of incompletely constructed
  objects introduces many variants of possible failures that need
  to be considered in the code. There may be many different stages of
  incompleteness. Because of the third problem,
  every time a construction step takes the result of a previous step as an
  argument, it explicitly has to consider the error case.
  This, in turn, tremendously inflates the test space of the code.

  Furthermore, there needs to be a convention of how the completion of an
  object is indicated. All programmers have to learn and follow the convention.

# The error condition triggers an exception. Thereby, the object construction
  immediately stops at the erroneous step. Subsequent steps are not
  executed at all. Furthermore, while unwinding the stack, the exception
  mechanism reverts all already completed steps by calling their respective
  destructors. Consequently, the construction of an object can be considered
  as a transaction. If it succeeds, the object is known to be completely
  constructed. If it fails, the object immediately ceases to exist.

Thanks to the transactional semantics of the second variant, the state space
for potential error conditions (and thereby the test space) remains small.
Also, the second variant facilitates the use of references as class members,
which can be safely passed as arguments to subsequent constructors. When
receiving such a reference as argument (as opposed to a pointer), no
validity checks are needed.
Consequently, by using exceptions, the robustness of object-oriented code
(i.e., code that relies on C++ constructors) can be greatly improved over code
that avoids exceptions.


Bare-metal C++ runtime
~~~~~~~~~~~~~~~~~~~~~~

Acknowledging the rationale given in the previous section, there is
still the problem of the complexity added by the exception mechanism.
For Genode, the complexity of the trusted computing base is a fundamental
metric. The C++ exception mechanism with its dependency to the C library
arguably adds significant complexity. The code complexity of a C
library exceeds the complexity of the fundamental components (such as the
kernel, core, and init) by an order of magnitude. Making the fundamental
components depend on such a C library would jeopardize one of Genode's most
valuable assets, which is its low complexity.

To enable the use of C++ exceptions and runtime type information but
avoid the incorporation of an entire C library into the trusted computing
base, Genode comes with a customized C++ runtime that does not depend on
a C library. The C++ runtime libraries are provided by the tool chain,
which interface with the symbols provided by Genode's C++ support code
(_repos/base/src/lib/cxx_).

Unfortunately, the interface used by the C++ runtime does not reside
in a specific namespace but it is rather a subset of the POSIX API. When
linking a real C library to a Genode component, the symbols present in the
C library would collide with the symbols present in Genode's C++ support code.
For this reason, the C++ runtime (of the compiler) and Genode's C++
support code are wrapped in a single library (_repos/base/lib/mk/cxx.mk_) in
a way that all POSIX functions remain hidden. All the references of the
C++ runtime are resolved by the C++ support code, both wrapped in the cxx
library. To the outside, the cxx library solely exports the CXA ABI as
required by the compiler.


Interaction of core with the underlying kernel
==============================================

Core is the root of the component tree. It is initialized and started
directly by the underlying kernel and has two purposes. First, it makes
the low-level physical resources of the machine available to other components
in the form of services. These resources are physical memory, processing
time, device resources, initial boot modules, and protection mechanisms (such
as the MMU, IOMMU, and virtualization extensions). It thereby
hides the peculiarities of the used kernel behind an API that is uniform
across all kernels supported by Genode. Core's second purpose is the
creation of the init component by using its own services and following the
steps described in Section [Component creation].

Even though core is executed in user mode, its role as the root of the
component tree makes it as critical as the kernel. It just happens to be
executed in a different processor mode. Whereas regular components solely
interact with the kernel when performing inter-component communication, core
interplays with the kernel more intensely. The following subsections go
into detail about this interplay.

The description tries to be general across the various kernels supported
by Genode. Note, however, that a particular kernel may deviate from the
general description.


System-image assembly
~~~~~~~~~~~~~~~~~~~~~

A Genode-based system consists of potentially many boot modules. But boot
loaders - in particular on ARM platforms - usually support the loading of a
single system image only. To unify the boot procedure across kernels and CPU
architectures, on all kernels except Linux, Genode merges boot modules
together with the core component into a single image.

The core component is actually built as a library. The library
description file is specific for each platform and located at
_lib/mk/spec/<pf>/core.mk_ where _<pf>_ corresponds to the
hardware platform used. It includes the platform-agnostic _lib/mk/core.inc_ file.
The library contains everything core needs (including the C++ runtime and
the core code) except the following symbols:

:'_boot_modules_headers_begin' and '_boot_modules_headers_end':
  Between those symbols, core expects an array of boot-module header
  structures. A boot-module header contains the name, core-local
  address, and size of a boot module. This meta data is used by
  core's initialization code in _src/core/platform.cc_ to populate the ROM
  service with modules.

:'_boot_modules_binaries_begin' and '_boot_modules_binaries_end':
  Between those symbols, core expects the actual module data.
  This range is outside the core image (beyond '_prog_img_end').
  In contrast to the boot-module headers, the modules reside in a
  separate section that remains unmapped within core's virtual address
  space. Only when access to a boot module is required by core (i.e., the
  ELF binary of init during the creation of the init component), core
  makes the module visible within its virtual address space.

  Making the boot modules invisible to core has two benefits. The
  integrity of the boot modules does not depend on core. Even in the
  presence of a bug in core, the boot modules cannot be accidentally
  overwritten. Second, no page-table entries are needed to map
  the modules into the virtual address space of core. This is particularly
  beneficial when using large boot modules such as a complete disk image.
  If incorporated into the core image, page-table
  entries for the entire disk image would need to be allocated at
  the initialization time of core.

These symbols are defined in an assembly file called _boot_modules.s_.
When building core stand-alone, the final linking stage combines the
core library with the dummy _boot_modules.s_ file located at
_src/core/boot_modules.s_.
But when using the run tool (Section [Run tool]) to integrate a
bootable system image, the run tool dynamically generates a version of
_boot_modules.s_ depending on the boot modules listed in the run script
and repeats the final linking
stage of core by combining the core library with the generated
_boot_modules.s_ file.
The generated file is placed at _<build-dir>/var/run/<scenario>/_
and incorporates the boot modules using the assembler's '.incbin' directive.
The result of the final linking stage is an executable ELF binary that
contains both core and the boot modules.


Bootstrapping and allocator setup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

At boot time, the kernel passes information about the physical resources and
the initial system state to core. Even though the mechanism and format of this
information varies from kernel to kernel, it generally covers the following
aspects:

* A list of free physical memory ranges
* A list of the physical memory locations of the boot modules along with their
  respective names
* The number of available CPUs
* All information needed to enable the initial thread to perform kernel
  operations


Core's allocators
-----------------

Core's kernel-specific platform initialization code (_core/platform.cc_)
uses this information to initialize the allocators used for keeping track
of physical resources. Those allocators are:

:RAM allocator: contains the ranges of the available physical memory
:I/O memory allocator: contains the physical address ranges of unused
  memory-mapped I/O resources. In general, all ranges not initially present in
  the RAM allocator are considered to be I/O memory.
:I/O port allocator: contains the I/O ports on x86-based platforms that are
  currently not in use. This allocator is initialized with the entire
  I/O port range of 0 to 0xffff.
:IRQ allocator: contains the IRQs that are associated with IRQ sessions.
  This allocator is initialized with the entirety of the available IRQ
  numbers.
:Core-region allocator: contains the virtual memory regions of core that
  are not in use.

The RAM allocator and core-region allocator are subsumed in the so-called
core-memory allocator. In addition to aggregating both allocators, the
core-memory allocator allows for the allocation of core-local virtual-memory
regions that can be used for holding core-local objects. Each region
allocated from the core-memory allocator has to satisfy three conditions:

# It must be backed by a physical memory range (as allocated from the RAM
  allocator)
# It must have assigned a core-local virtual memory range (as allocated
  from the core-region allocator)
# The physical-memory range must have the same size as the virtual-memory range
# The virtual memory range must be mapped to the physical memory range using
  the MMU

Internally, the core-memory allocator maintains a so-called mapped-memory
allocator that contains ranges of ready-to-use core-local memory. If a new
allocation exceeds the available capacity, the core-memory allocator expands
its capacity by allocating a new physical memory region from the RAM
allocator, allocating a new core-virtual memory region from the core-region
allocator, and installing a mapping from the virtual region to the physical
region.

All memory allocations mentioned above are performed at the granularity of
physical pages, i.e., 4 KiB.

The core-memory allocator is expanded on demand but never shrunk.
This makes it unsuitable for allocating objects on behalf of core's clients
because allocations could not be reverted when closing a session.
It is solely used for dynamic memory allocations at startup (e.g., the
memory needed for keeping the information about the boot modules),
and for keeping meta data for the allocators themselves.

; Boot modules
; ------------


Kernel-object creation
~~~~~~~~~~~~~~~~~~~~~~

Kernel objects are objects maintained within the kernel and used by the
kernel.
The exact notion of what a kernel object represents depends on the actual
kernel as the various kernels differ with respect to the abstractions they
provide.
Typical kernel objects are threads and protection domains.
Some kernels have kernel objects for memory mappings while others provide
page tables as kernel objects.
Whereas some kernels represent scheduling parameters as distinct kernel
objects, others subsume scheduling parameters to threads.
What all kernel objects have in common, though, is that they consume kernel
memory.
Most kernels of the L4 family preserve a fixed pool of memory for the
allocation of kernel objects.

If an arbitrary component were able to perform a kernel operation that triggers
the creation of a kernel object, the memory consumption of the kernel would
depend on the good behavior of all components. A misbehaving component may
exhaust the kernel memory.

To counter this problem, on Genode, only core triggers the creation of kernel
objects and thereby guards the consumption of kernel memory. Note, however,
that not all kernels are able to prevent the creation of kernel objects
outside of core.


Page-fault handling
~~~~~~~~~~~~~~~~~~~

Each time a thread within the Genode system triggers a page fault, the kernel
reflects the page fault along with the fault information as a message to the
user-level page-fault handler residing in core.
The fault information comprises the identity and instruction pointer of the
faulted thread, the page-fault address, and the fault type (read, write,
execute).
The page-fault handler represents each thread as a so-called _pager object_,
which encapsulates the subset of the thread's interface that is needed to
handle page faults.
For handling the page fault, the page-fault handler first looks up the pager
object that belongs to the faulting thread's identity,
analogously to how an RPC entrypoint looks up the RPC object for an incoming
RPC request.
Given the pager object, the fault is handled by calling the 'pager' function
with the fault information as argument. This function is implemented by
the so-called 'Rm_client' (_repos/base/src/core/region_map_component.cc_),
which represents the association of the pager object
with its virtual address space (region map). Given the context
information about the region map of the thread's PD, the 'pager' function
looks up the region within the region map, on which the page fault occurred.
The lookup results in one of the following three cases:

:Region is populated with a dataspace:
  If a dataspace is attached at the fault address, the backing store of the
  dataspace is determined.
  Depending on the kernel, the backing store
  may be a physical page, a core-local page, or another reference to a physical
  memory page.
  The pager function then installs a memory mapping from the virtual page where
  the fault occurred to the corresponding part of the backing store.

:Region is populated with a managed dataspace:
  If the fault occurred within a region where a managed dataspace is
  attached, the fault handling is forwarded to the region map that
  represents the managed dataspace.

:Region is empty:
  If no dataspace could be found at the fault address, the fault cannot
  be resolved. In this case, core submits an region-map-fault signal to the
  region map where the fault occurred. This way, the region-map client has
  the chance to detect and possibly respond to the fault. Once the signal
  handler receives a fault signal, it is able to query the fault address
  from the region map.
  As a response to the fault, the region-map client may attach a dataspace at
  this address.
  This attach operation, in turn, will prompt core to wake up the thread
  (or multiple threads) that faulted within the attached region.
  Unless a dataspace is attached at the page-fault address, the faulting
  thread remains blocked.
  If no signal handler for region-map faults is registered for the region map,
  core prints a diagnostic message and blocks the faulting thread forever.

To optimize the TLB footprint and the use of kernel memory, region maps
do not merely operate at the granularity of memory pages but on
address ranges whose size and alignment are arbitrary power-of-two values (at
least as large as the size of the smallest physical page).
The source and destinations of memory mappings may span many pages.
This way, depending on the kernel and the architecture, multiple pages may be
mapped at once, or large page-table mappings can be used.



;Capability mechanism in depth
;=============================
;
; Capability representation as C++ object
; Marshalling of capabilities as RPC arguments
; Life-time management


Asynchronous notification mechanism
===================================

Section [Asynchronous notifications] introduces asynchronous notifications
(signals) as one of the fundamental inter-component communication mechanisms.
The description covers the semantics of the mechanism but the question of how
the mechanism relates to core and the underlying kernel remains unanswered.
This section complements Section [Asynchronous notifications] with those
implementation details.

Most kernels do not directly support the semantics of asynchronous
notifications as presented in Section [Asynchronous notifications]. As a
reminder, the mechanism has the following features:

* The authority for triggering a signal is represented by a signal-context
  capability, which can be delegated via the common capability-delegation
  mechanism described in
  Section [Capability delegation through capability invocation].

* The submission of a signal is a fire-and-forget operation. The signal
  producer is never blocked.

* On the reception of a signal, the signal handler can obtain the context
  to which the signal refers. This way, it is able to distinguish
  different sources of events.

* A signal receiver can wait or poll for potentially many signal
  contexts.
  The number of signal contexts associated with a single signal receiver is not
  limited.

The gap between this feature set and the mechanisms provided by the underlying
kernel is bridged by core as part of the PD service. This service
plays the role of a proxy between the producers and receivers of signals.
Each component that interacts with signals has a session to this service.

Within core, a signal context is represented as an RPC object. The RPC object
maintains a counter of signals pending for this context. Signal
contexts can be created and destroyed by the clients of the PD service
using the _alloc_context_ and _free_context_ RPC functions. Upon the creation
of a signal context, the PD client can specify an integer value called
_imprint_ with a client-local meaning. Later, on the reception of signals,
the imprint value is delivered along with the signal to enable the
client to tell the contexts of the incoming signals apart. As a result of
the allocation of a new signal context, the client obtains a signal-context
capability. This capability can be delegated to other components using
the regular capability-delegation mechanism.


Signal submission
-----------------

A component in possession of a signal-context capability is able to trigger
signals using the _submit_ function
of its PD session. The submit function takes the signal context capability
of the targeted context and a counter value as arguments. The capability as
supplied to the submit function does not need to originate from the called
session. It may have been created and delegated by another component.
Note that even though a signal context is an RPC object, the submission of a
signal is not realized as an invocation of this object. The signal-context
capability is merely used as an RPC function argument. This design accounts
for the fact that signal-context capabilities may originate from untrusted
peers as is the case for servers that deliver asynchronous notifications
to their clients.
A client of such a server supplies a signal-context capability as argument
to one of the server's RPC functions.
An example is the input session interface (Section [Input]) that allows the
client to get notified when new user input becomes available.
A malicious client may specify a capability that was not created via core's
PD service but that instead refers to an RPC object local to the client.
If the submit function was an RPC function of the signal context, the
server's call of the submit RPC function would eventually invoke the
RPC object of the client. This would put the client in a position where
it may block the server indefinitely and thereby make the server unavailable to
all clients. In contrast to the untrusted signal-context capability, the
PD session of a signal producer is by definition trusted. So it is safe
to invoke the submit RPC function with the signal-context capability as
argument. In the case where an invalid signal-context capability is delegated
to the signal producer, core will fail to look up a signal context for the
given capability and omit the signal.


Signal reception
----------------

For receiving signals, a component needs a way to obtain information about
pending signals from core. This involves two steps: First, the component
needs a way to block until signals are available. Second, if a signal is
pending, the component needs a way to determine the signal context and the
signal receiver associated with the signal and wake up the thread that
blocks the 'Signal_receiver::block_for_signal' API function.

Both problems are solved by a dedicated thread that is spawned during
component startup. This signal thread blocks at core's PD
service for incoming signals. The blocking operation is not directly
performed on the PD session but on a decoupled RPC object called
_signal source_.
In contrast to the PD session interface that is kernel agnostic, the
underlying kernel mechanism used for blocking
the signal thread at the signal source depends on the used base
platform.

The signal-source RPC object implements an RPC interface, on which the PD
client issues a blocking _wait_for_signal_ RPC function.
This function blocks as long as no signal that refers to the session's signal
contexts is pending. If the function returns, the return value contains the
imprint that was assigned to the signal context at its creation and
the number of signals pending for this context.
On most base platforms, the implementation of the blocking RPC interface is
realized by processing RPC requests and responses out of order to enable one
entrypoint in core to serve all signal sources. Core uses a dedicated
entrypoint for the signal-source handling to decouple the delivery of signals
from potentially long-taking operations of the other core services.

Given the imprint value returned by the signal source, the signal thread
determines the signal context and signal receiver that belongs to the pending
signal (using a data structure called 'Signal_context_registry') and locally
submits the signal to the signal-receiver object. This, in turn, unblocks the
'Signal_receiver::block_for_signal' function at the API level.


Parent-child interaction in detail
==================================

On a conceptual level, the session-creation procedure as described in
Section [Services and sessions] appears as a synchronous interaction
between the parent and its child components. The interaction serves three
purposes. First, it is used to communicate information between different
protection domains, in this case the parent, the client, and the server.
Second, it implicitly dictates the flow of control between the involved
parties because the caller blocks until the callee replies.
Third, the interplay delegates authority (in particular authority to
access the server's session object) between protection domains. The latter is
realized with the kernel's ability to carry capabilities as IPC message
payload.

[tikz img/async_session_seq]
  Parent-child interplay during the creation of a new session.
  The dotted lines are asynchronous notifications, which have fire-and-forget
  semantics. A component that triggers a signal does not block.

On the surface, the interaction looks like a sequence of synchronous RPC
calls. However, under the hood, the interplay between the parent and its
children is based on a combination of asynchronous notifications from
the parent to the children and synchronous RPC from the children to the
parent. The protocol is designed such that the parent's liveliness remains
independent from the behavior of its children, which must generally be
regarded as untrusted from the parent's perspective. The sequence of creating
a session is depicted in Figure [img/async_session_seq].
The following points are worth noting:

* Sessions are identified via IDs, which are plain numbers as opposed to
  capabilities. The IDs as seen by the client and server belong to different
  ID name spaces.
  IDs of sessions requested by the client are allocated by the client. IDs
  of sessions requested at the server are allocated by the parent.
* The parent does not issue RPC calls to any of its children.
* Each activation of the parent merely applies a state change of the session's
  meta data structures maintained at the parent, which capture the entire
  state of session requests.
* The information about pending session requests is communicated from the
  parent to the server via a ROM session. At startup, the server requests
  a ROM session for the ROM module "session_requests" from its parent. The
  parent implements this ROM session locally. Since ROM sessions support
  versions, the parent can post version updates of the "session_requests"
  ROM with the regular mechanisms already present in Genode.
* The parties involved can potentially run in parallel.


Dynamic linker
==============

The dynamic linker is a mechanism for loading ELF binaries that are
dynamically-linked against shared libraries.


Building dynamically-linked programs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The build system automatically decides whether a program is linked statically
or dynamically depending on the use of shared libraries. If the target
is linked against at least one shared library, the resulting ELF image
is a dynamically-linked program. Almost all Genode components are linked
against the Genode application binary interface (ABI), which is a shared
library. Therefore, components are dynamically-linked programs unless a
kernel-specific base library is explicitly used.

The entrypoint of a dynamically-linked program is the 'Component::construct'
function.


Startup of dynamically-linked programs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When creating a new component,
the parent first detects whether the to-be-loaded ELF binary represents
a statically-linked program or a dynamically-linked program by inspecting
the ELF binary's program-header information (see
_repos/base/src/lib/base/elf_binary.cc_).
If the program is statically linked, the parent follows the procedure as
described in Section [Component creation]. If the program is dynamically
linked, the parent remembers the dataspace of the program's ELF image but
starts the ELF image of the dynamic linker instead.

The dynamic linker is a regular Genode component that follows the startup
procedure described in Section [Startup code]. However, because of its
hybrid nature, it needs to take special precautions before using any
data that contains relocations. Because the dynamic linker is a shared
library, it contains data relocations. Even though the linker's code is
position independent and can principally be loaded to an arbitrary address,
global data objects may contain pointers to other global data objects or
code. For example, vtable entries contain pointers to code. Those pointers
must be relocated depending on the load address of the binary. This step is
performed by the 'init_rtld' hook function, which was already mentioned in
Section [Startup code]. Global data objects must not be used before calling
this function. For this reason, 'init_rtld' is called at the earliest possible
time directly from the assembly startup code.
Apart from the call of this hook function, the startup of the dynamic linker
is the same as for statically-linked programs.

The main function of the dynamic linker obtains the binary of the actual
dynamically-linked program by requesting a ROM session for the module
"binary". The parent responds to this request by handing out a
locally-provided ROM session that contains the dataspace of the actual
program. Once the linker has obtained the dataspace containing the
dynamically-linked program, it loads the program and all required shared
libraries. The dynamic linker requests each shared library as a ROM
session from its parent.

After completing the loading of all ELF objects, the dynamic linker determines
the entry point of the loaded binary by looking up the 'Component::construct'
symbol and calls it as a function. Note that this particular symbol is
ambiguous as both the dynamic linker and the loaded program have such a
function. Hence, the lookup is performed explicitly on the loaded program.


Address-space management
~~~~~~~~~~~~~~~~~~~~~~~~

To load the binary and the associated shared libraries, the linker does not
directly attach dataspaces to its address space. Instead, it manages a dedicated
part of the component's virtual address space called _linker area_ manually.
The linker area is a region map that is created as part of a PD session.
The dynamic linker attaches the linker area as a managed dataspace to its
address space. This way, the linker can precisely
control the layout within the virtual-address range covered by the managed
dataspace. This control is needed because the loading of an ELF object does
not correspond to an atomic attachment of a single dataspace but it involves
consecutive attach operations for multiple dataspaces, one for each ELF
segment. When attaching one segment, the linker must make sure that there is
enough space beyond the segment to host the next segment. The use of a managed
dataspace allows the linker to manually allocate large-enough portions of
virtual memory and populate them in multiple steps.


Execution on bare hardware (base-hw)
====================================

The code specific to the base-hw platform is located within the
_repos/base-hw/_ directory. In the following description, unless explicitly
stated otherwise, all paths are relative to this directory.

In contrast to classical L4 microkernels where Genode's core process runs as
user-level roottask on top of the kernel, base-hw executes Genode's core
directly on the hardware with no distinct kernel underneath. Core and the
kernel are melted into one hybrid component. Although all threads of core are
running in privileged processor mode, they call a kernel library to synchronize
hardware interaction. However, most work is done outside of that library. This
design has several benefits. First, the kernel part becomes much simpler. For
example, there are no allocators needed within the kernel. Second, base-hw side-steps
long-standing difficult kernel-level problems, in particular the management of kernel
resources. For the allocation of kernel objects, the hybrid core/kernel can
employ Genode's user-level resource trading concepts as described in Section
[Resource trading]. Finally and most
importantly, merging the kernel with roottask removes a lot of
redundancies between both programs. Traditionally, both kernel and roottask
perform the book keeping of physical-resource allocations and the existence
of kernel objects such as address spaces and threads. In base-hw, those data
structures exist only once. The complexity of the combined kernel/core is
significantly lower than the sum of the complexities of a traditional
self-sufficient kernel and a distinct roottask on top. This way, base-hw helps
to make Genode's TCB less complex.

The following subsections detail the problems that base-hw had to address
to become a self-sufficient base platform for Genode.


Bootstrapping of base-hw
~~~~~~~~~~~~~~~~~~~~~~~~

;Further topic: The bootstrap component
;               -----------------------
;
; * solves the problem of different code linkage before/after MMU enablement
; * starts with disabled MMU, collects all platform information
; * provides platform information including all available mappings like
;   kernel devices and binary in a unified fashion via: _Hw::Boot_info_
; * initializes all kernel-related hardware
; * may lock protection related hardware like MMU, TrustZone,
;   virtualization-specific registers if provided
; * security-sensitive initialization code is not available to kernel/core
;   afterwards


Startup of the base-hw kernel
-----------------------------

Core on base-hw uses Genode's regular linker script. Like any
regular Genode component, its execution starts at the '_start' symbol.
But unlike a regular component, core is started by the bootstrap component as
a kernel running in privileged mode. Instead of directly following the startup
procedure described in Section [Startup code], base-hw uses custom startup code
that initializes the kernel part of core first. For example, the startup code
for the ARM architecture is located at _src/core/spec/arm/crt0.s_.
It calls the kernel initialization code in _src/core/kernel/init.cc_.
Core's regular C++ startup code (the '_main' function) is executed by the first
thread created by the kernel (see the thread setup in the
'Core_thread::Core_thread()' constructor).


Kernel entry and exit
~~~~~~~~~~~~~~~~~~~~~

The execution model of the kernel can be roughly characterized as a
single-stack kernel. In contrast to traditional L4 kernels that maintain one
kernel thread per user thread, the base-hw kernel is a mere state machine
that never blocks in the kernel. State transitions are triggered by
core or user-level threads that enter the kernel via a system call, by device
interrupts, or by a CPU exception. Once entered, the kernel applies the state
change depending on the event that caused the kernel entry, and leaves the
kernel again. The transition between normal threads and kernel execution
depends on the concrete architecture. For ARM, the corresponding code is located
at _src/core/spec/arm/exception_vector.s_.


Interrupt handling and preemptive multi-threading
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to respond to interrupts, base-hw has to contain a driver for
the interrupt controller. The interrupt-controller driver for
a particular hardware platform can be found at _src/core/spec/<spec>/pic.h_
and the corresponding _src/core/spec/<spec>/pic.cc_. Whereby _<spec>_
refers to a particular platform (e.g., imx53) or an IP block that is
is used across different platforms (e.g., arm_gic for ARM's generic
interrupt controller).
Each of the drivers implement the same interface. When building core,
the build system uses the build-spec mechanism explained in
Section [Build system] to incorporate the single driver needed for the
targeted SoC.

To support preemptive multi-threading, base-hw requires a hardware timer.
The timer is programmed with the time slice length of the currently
executed thread. Once the programmed timeout elapses, the timer device
generates an interrupt that is handled by the kernel. Similarly to
interrupt controllers, there exist a variety of different timer devices
for different CPUs. Therefore, base-hw contains different timer drivers.
The timer drivers are located at _src/core/spec/<spec>/timer_driver.h_
where _<spec>_ refers to the timer variant.

The in-kernel handler of the timer interrupt invokes the thread scheduler
(_src/core/kernel/cpu_scheduler.h_).
The scheduler maintains a list of so-called scheduling contexts where each
context refers to a thread. Each time the kernel is entered, the scheduler
is updated with the passed duration. When updated, it takes a scheduling
decision by making the next to-be-executed thread the head of the list.
At kernel exit, the control is passed to the user-level thread that
corresponds to the head of the scheduler list.


Split kernel interface
~~~~~~~~~~~~~~~~~~~~~~

The system-call interface of the base-hw kernel is split into two parts.
One part is usable by all components and solely contains system calls for
inter-component communication and thread synchronization. The definition
of this interface is located at _include/kernel/interface.h_. The second
part is exposed only to core. It supplements the public interface with
operations for the creation, the management, and the destruction of kernel
objects. The definition of the core-private interface is located at
_src/core/kernel/core_interface.h_.

The distinction between both parts of the kernel interface is enforced
by the function 'Thread::_call' in _src/core/kernel/thread.cc_.


Public part of the kernel interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Threads do not run independently but interact with each other via synchronous
inter-component communication as detailed in Section
[Inter-component communication]. Within base-hw, this mechanism is referred
to as IPC (for inter-process communication).
To allow threads to perform calls to other threads or to receive RPC requests,
the kernel interface is equipped with system calls for performing IPC
(_send_request_msg_, _await_request_msg_, _send_reply_msg_).
To keep the kernel as simple as possible, IPC is performed using so-called
user-level thread-control blocks (UTCB).
Each thread has a corresponding memory page that is always
mapped in the kernel. This UTCB page is used to carry IPC payload. The largely
simplified procedure of transferring a message is as follows. (In reality, the
state space is more complex because the receiver may not be in a blocking state
when the sender issues the message)

# The sender marshals its payload into its UTCB and invokes the kernel,
# The kernel transfers the payload from the sender's UTCB to the receiver's
  UTCB and schedules the receiver,
# The receiver retrieves the incoming message from its UTCB.

Because all UTCBs are always mapped in the kernel, no page faults can occur
during the second step. This way, the flow of execution within the kernel
becomes predictable and no kernel exception handling code is needed.

In addition to IPC, threads interact via the synchronization primitives
provided by the Genode API. To implement these portions of the API, the kernel
provides system calls for managing the execution control of threads
(_stop_thread_, _restart_thread_, _yield_thread_).

To support asynchronous notifications as described in Section
[Asynchronous notifications], the kernel provides system calls for the
submission and reception of signals (_await_signal_, _cancel_next_await_signal_,
_submit_signal_, _pending_signal_, and _ack_signal_) as well as the life-time
management of signal contexts (_kill_signal_context_). In contrast to other
base platforms, Genode's signal API is directly supported by the kernel
so that the propagation of signals does not require any interaction with
core's PD service.
However, the creation of signal contexts is arbitrated by the PD service.
This way, the kernel objects needed for the signalling mechanism are
accounted to the corresponding clients of the PD service.

The kernel provides an interface to make the kernel's scheduling timer
available as time source to the user land. Using this interface,
components can bind signal contexts to timeouts (_timeout_) and
follow the progress of time (_time_ and _timeout_max_us_).


Core-private part of the kernel interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The core-private part of the kernel interface allows core to perform
privileged operations. Note that even though the kernel and core provide
different interfaces, both are executed in privileged CPU mode, share
the same address space and ultimately trust
each other. The kernel is regarded a mere support library of core that
executes those functions that shall be synchronized between different
CPU cores and core's threads. In particular, the kernel does not perform
any allocation. Instead, the allocation of kernel objects is performed as
an interplay of core and the kernel.

# Core allocates physical memory from its physical-memory allocator.
  Most kernel-object allocations are performed in the context of one
  of core's services. Hence, those allocations can be properly accounted
  to a session quota (Section [Resource trading]). This way, kernel objects
  allocated on behalf of core's clients are "paid for" by those clients.

# Core allocates virtual memory to make the allocated physical memory visible
  within core and the kernel.

# Core invokes the kernel to construct the kernel object at the location
  specified by core. This kernel invocation is actually a system call that
  enters the kernel via the kernel-entry path.

# The kernel initializes the kernel object at the virtual address specified
  by core and returns to core via the kernel-exit path.

The core-private kernel interface consists of the following operations:

* The creation and destruction of protection domains
  (_new_pd_, _update_pd_, _delete_pd_), invoked by the PD service
* The creation, manipulation, and destruction of threads
  (_new_thread_, _start_thread_, _resume_thread_, _thread_quota_,
  _pause_thread_, _delete_thread_, _thread_pager, and _cancel_thread_blocking_),
  used by the CPU service
  and the core-specific back end of the 'Genode::Thread' API
* The creation and destruction of signal receivers and signal contexts
  (_new_signal_receiver_, _delete_signal_receiver_, _new_signal_context_, and
  _delete_signal_context_), invoked by the PD service
* The creation and destruction of kernel-protected object identities
  (_new_obj_, _delete_obj_)
* The creation, manipulation, and destruction of interrupt kernel objects
  (_new_irq_, _ack_irq_, and _delete_irq_)
* The mechanisms needed to transfer the flow of control between virtual
  machines and virtual-machine monitors (_new_vm_, _delete_vm_, _run_vm_,
  _pause_vm_)


Scheduler of the base-hw kernel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

CPU scheduling in traditional L4 microkernels is based on static priorities.
The scheduler always picks the runnable thread with highest priority for
execution.
If multiple threads share one priority, the kernel schedules those threads
in a round-robin fashion.
Whereas being pretty fast and easy to implement, this scheme has disadvantages:
First, there is no way to prevent
high-prioritized threads from starving lower-prioritized ones. Second, CPU time
cannot be granted to threads and passed between them by the means of quota.
To cope with these problems without much loss of performance, base-hw employs
a custom scheduler that deviates from the traditional approach.

The base-hw scheduler introduces the distinction between high-throughput-oriented
scheduling contexts - called _fills_ - and low-latency-oriented
scheduling contexts - called _claims_. Examples for typical fills would be
the processing of a compiler job or the rendering computations of a sophisticated
graphics program. They shall obtain as much CPU time as the system can spare
but there is no demand for a high responsiveness. In contrast, an example
for the claim category would be a typical GUI-software stack covering the
control flow from user-input drivers through a chain of GUI components to the
drivers of the graphical output. Another example is a user-level device driver
that must quickly respond to sporadic interrupts but is otherwise untrusted.
The low latency of such components is a key factor for usability and
quality of service. Besides introducing the distinction between claim and fill
scheduling contexts, base-hw introduces the notion of a so-called
_super period_, which is a multiple of typical scheduling time slices, e.g.,
one second. The entire super period
corresponds to 100% of the CPU time of one CPU. Portions of it can be assigned
to scheduling contexts. A CPU quota thereby corresponds to a percentage of the
super period.

At the beginning of a super period, each claim has its full amount of assigned
CPU quota. The priority defines the absolute scheduling order within the super
period among those claims that are active and have quota left. As long as
there exist such claims, the scheduler stays in the claim mode and the quota
of the scheduled claims decreases. At the end of a super period, the quota of
all claims is replenished to the initial value. Every time the scheduler can't
find an active claim with CPU-quota left, it switches to the fill mode. Fills
are scheduled in a simple round-robin fashion with identical time slices. The
proceeding of the super period doesn't affect the scheduling order and
time-slices of this mode. The concept of quota and priority that is
implemented through the claim mode aligns nicely with Genode's way of
hierarchical resource management: Through CPU sessions, each process becomes
able to assign portions of its CPU time and subranges of its priority band to
its children without knowing the global meaning of CPU time or priority.


Sparsely populated core address space
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Even though core has the authority over all physical memory, it has no
immediate access to the physical pages. Whenever core requires access to a
physical memory page, it first has to explicitly map the physical page into
its own virtual memory space. This way, the virtual address space of core
stays clean from any data of other components. Even in the presence of a bug
in core (e.g., a dangling pointer), information cannot accidentally leak
between different protection domains because the virtual memory of the other
components is not necessarily visible to core.


Multi-processor support of base-hw
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On uniprocessor systems, the base-hw kernel is single-threaded. Its
execution model corresponds to a mere state machine.
On SMP systems, it maintains one kernel thread and one scheduler per CPU core.
Access to kernel
objects gets fully serialized by one global spin lock that is acquired
when entering the kernel and released when leaving the kernel. This keeps the
use of multiple cores transparent to the kernel model, which greatly
simplifies the code compared to traditional L4 microkernels. Given
that the kernel is a simple state machine providing lightweight non-blocking
operations, there is little contention for the global kernel
lock. Even though this claim may not hold up when scaling to a large number of
cores, current platforms can be accommodated well.


Cross-CPU inter-component communication
---------------------------------------

Regarding synchronous and asynchronous inter-processor communication - thanks
to the global kernel lock - there is no semantic difference to the uniprocessor
case. The only difference is that on a multiprocessor system, one processor may
change the schedule of another processor by unblocking one of its threads
(e.g., when an RPC call is received by a server that resides on a different CPU
as the client).
This condition may rescind the current scheduling choice of the other processor.
To avoid lags in this case, the kernel lets the unaware target processor trap
into an inter-processor interrupt (IPI).
The targeted processor can react to the IPI by taking the decision to
schedule the receiving thread.
As the IPI sender does not have to wait for an answer, the sending and
receiving CPUs remain largely decoupled.
There is no need for a complex IPI protocol between sender and receiver.


TLB shootdown
-------------

With respect to the synchronization of core-local hardware, there are two
different situations to deal with. Some hardware components like most ARM
caches and branch predictors implement their own coherence protocol and thus
need adaption in terms of configuration only. Others, like the TLBs lack this
feature. When for instance a page table entry gets invalid, the TLB invalidation
of the affected entries must be performed locally by each core. To signal the
necessity of TLB maintenance work, an IPI is sent to all other cores. Once all
cores have completed the cleaning, the thread that invoked the TLB invalidation
resumes its execution.

; Further possible topics
; * TrustZone
; * kernel bootstrap

Asynchronous notifications on base-hw
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The base-hw platform improves the mechanism described in Section
[Asynchronous notification mechanism] by introducing signal receivers and
signal contexts as first-class kernel objects. Core's
PD service is merely used to arbitrate the creation and destruction of
those kernel objects but it does not play the role of a signal-delivery proxy.
Instead, signals are communicated directly by using the public kernel
operations _await_signal_, _cancel_next_await_signal_, _submit_signal_, and
_ack_signal_.


Execution on the NOVA microhypervisor (base-nova)
=================================================

NOVA is a so-called microhypervisor, denoting the combination of microkernel
and a virtualization platform (hypervisor). It is a high-performance
microkernel for the x86 architecture. In contrast to other microkernels,
it has been designed for hardware-based virtualization via user-level
virtual-machine monitors. In line with Genode's architecture, NOVA's kernel
interface is based on capability-based security. Hence, the kernel fully
supports the model of a Genode kernel as described in Section
[Capability-based security].

:NOVA website:

  [http://hypervisor.org]

:NOVA kernel-interface specification:

  [https://github.com/udosteinberg/NOVA/raw/master/doc/specification.pdf]


Integration of NOVA with Genode
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The NOVA kernel is available via Genode's ports mechanism detailed in
Section [Integration of 3rd-party software]. The port description is located
at _repos/base-nova/ports/nova.port_.

Building the NOVA kernel
------------------------

Even though NOVA is a third-party kernel with a custom build system,
the kernel is built directly by the Genode build system. NOVA's build
system remains unused.

From within a Genode build directory configured for one of the nova_x86_32
or nova_x86_64 platforms, the kernel can be built via

! make kernel

The build description for the kernel is located at
_repos/base-nova/src/kernel/target.mk_.

System-call bindings
--------------------

NOVA is not accompanied with bindings to its kernel interface. There
only is a description of the kernel interface in the form of the kernel
specification available. For this reason, Genode maintains the kernel
bindings for NOVA within the Genode source tree. The bindings are located
at _repos/base-nova/include/_ in the subdirectories _nova/_, _spec/32bit/nova/_,
and _spec/64bit/nova/_.


Bootstrapping of a NOVA-based system
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After finishing its initialization, the kernel starts the second boot module,
the first being the kernel itself, as root task. The root task is Genode's core.
The virtual address space of core contains the text and data segments of core, the
UTCB of the initial execution context (EC), and the hypervisor info page (HIP).
Details about the HIP are provided in Section 6 of the NOVA specification.

; XXX: UTCB and EC have not yet been introduced. Maybe the NOVA term glossary
;       could be moved up to the beginning of this section?

BSS section of core
-------------------

The kernel's ELF loader does not support the concept of a BSS segment. It
simply maps the physical pages of core's text and data segments into
the virtual memory of core but does not allocate any additional physical
pages for backing the BSS. For this reason, the NOVA version of core
does not use the _genode.ld_ linker script as described in Section
[Linker scripts] but the linker script located at
_repos/base-nova/src/core/core.ld_. This version hosts the BSS section
within the data segment. Thereby, the BSS is physically present in the core
binary in the form of zero-initialized data.

Initial information provided by NOVA to core
--------------------------------------------

The kernel passes a pointer to the HIP to core as the initial value of the
ESP register. Genode's startup code saves this value in the global variable
'_initial_sp' (Section [Startup code]).


Log output on modern PC hardware
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Because transmitting information over legacy comports does not require
complex device drivers, serial output over comports is still the predominant
way to output low-level system logs like kernel messages or the output of
core's LOG service.

Unfortunately, most modern PCs lack dedicated comports. This leaves two
options to obtain low-level system logs.

# The use of vendor-specific platform-management features such as Intel
  VPro / Intel Advanced Management Technology (AMT) or Intel Platform
  Management Interface (IPMI). These platform features are able to emulate a
  legacy comport and provide the serial output over the network.
  Unfortunately, those solutions are not uniform across different vendors,
  difficult to use, and tend to be unreliable.

# The use of a PCI card or an Express Card that provides a physical comport.
  When using such a device, the added comport appears as PCI I/O resource.
  Because the device interface is compatible to the legacy comports,
  no special drivers are needed.

The latter option allows the retrieval of low-level system logs on hardware
that lacks special management features.
In contrast to the legacy comports, however, it has the minor disadvantage
that the location of the device's I/O resources is not known beforehand.
The I/O port range of the comport depends on the device-enumeration
procedure of the BIOS. To enable the kernel to output information
over this comport, the kernel must be configured with the I/O port range
as assigned by the BIOS on the specific machine. One kernel binary
cannot simply be used across different machines.

The Bender chain boot loader
----------------------------

To alleviate the need to adapt the kernel configuration to the used comport
hardware, the bender chain boot loader can be used.

:Bender is part of the MORBO tools:

  [https://github.com/TUD-OS/morbo]

Instead of starting the NOVA hypervisor directly, the multi-boot-compliant
boot loader (such as GRUB) starts bender as the kernel. All remaining
boot modules including the real kernel have already been loaded into memory
by the original boot loader. Bender scans the PCI bus for a comport device.
If such a device is found (e.g., an Express Card), it writes the information
about the device's I/O port range to a known offset within the BIOS data
area (BDA).

After the comport-device probing is finished, bender passes control to the
next boot module, which is the real kernel. The comport device driver of
the kernel does not use a hard-coded I/O port range for the comport but
looks up the comport location in the BDA.
The use of bender is optional. When not used, the BDA always contains the I/O
port range of the legacy comport 1.

The Genode source tree contains a pre-compiled binary of bender at
_tool/boot/bender_. This binary is automatically incorporated into boot images
for the NOVA base platform when the run tool (Section [Run tool]) is used.


Relation of NOVA's kernel objects to Genode's core services
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For the terminology of NOVA's kernel objects, refer to the NOVA specification
mentioned in the introduction of
Section [Execution on the NOVA microhypervisor (base-nova)].
A brief glossary for the terminology used in the remainder of this section is
given in table [nova_terminology].

   NOVA term  |
  --------------------------------------------------
   PD         | Protection domain
   EC         | Execution context (thread)
   SC         | Scheduling context
   HIP        | Hypervisor information page
   IDC        | Inter-domain call (RPC call)
   portal     | communication endpoint

[table nova_terminology]
  Glossary of NOVA's terminology


NOVA capabilities are not Genode capabilities
---------------------------------------------

Both NOVA and Genode use the term "capability". However, the term does not have
the same meaning in both contexts. A Genode capability refers to an RPC
object or a signal context. In the context of NOVA, a capability refers to
a NOVA kernel object. To avoid confusing both meanings of the term,
Genode refers to NOVA's term as "capability selector", or simply
"selector". A Genode signal context capability corresponds to a NOVA semaphore,
all other Genode capabilities correspond to NOVA portals.


PD service
----------

A PD session corresponds to a NOVA PD.

A Genode capability being a NOVA portal has a
defined IP and an associated local EC (the Genode entrypoint). The invocation
of a such a Genode capability is an IDC call to a portal. A Genode capability is
delegated by passing its corresponding portal or semaphore selector as IDC argument.

Page faults are handled as explained in Section
[Page-fault handling on NOVA]. Each memory mapping installed in a component
implicitly triggers the allocation of a node in the kernel's mapping
database.


CPU service
-----------

NOVA distinguishes between so-called global ECs and local ECs. A global EC can
be equipped with CPU time by associating it with an SC. It can perform
IDC calls but it cannot receive IDC calls. In contrast to a global EC,
a local EC is able to receive IDC calls but it has no CPU time. A local
EC is not executed before it is called by another EC.

A regular Genode thread is a global EC. A Genode entrypoint is a local EC.
Core distinguishes both cases based on the instruction-pointer (IP) argument
of the CPU session's start function. For a local EC, the IP is set to zero.


IO_MEM services
---------------

Core's RAM and IO_MEM allocators are initialized based on the information found
in NOVA's HIP.


ROM service
-----------

Core's ROM service provides all boot modules as ROM modules. Additionally,
a copy of NOVA's HIP is provided as a ROM module named "hypervisor_info_page".


IRQ service
-----------

NOVA represents each interrupt as a semaphore created by the kernel. By
registration of a Genode signal context capability via the sigh method of the
Irq_session interface, the semaphore of the signal context capability is
bound to the interrupt semaphore. Genode signals and NOVA semaphores are
handled as described in [Asynchronous notifications on NOVA].

Upon the initial IRQ session's _ack_irq_ call, a NOVA semaphore-down operation
is issued within core on the interrupt semaphore, which implicitly unmasks the
interrupt at the CPU. When the interrupt occurs, the kernel masks the interrupt
at the CPU and performs the semaphore-up operation on the IRQ's semaphore.
Thereby, the chained semaphore, which is the beforehand registered Genode
signal context, is triggered and the interrupt is delivered as
Genode signal. The interrupt gets acknowledged and unmasked by calling the
IRQ session's _ack_irq_ method.


Page-fault handling on NOVA
~~~~~~~~~~~~~~~~~~~~~~~~~~~

On NOVA, each EC has a pre-defined range of portal selectors.
For each type of exception, the range has a dedicated portal that is entered in
the event of an exception.
The page-fault portal of a Genode thread is defined at the creation
time of the thread and points to a pager EC per CPU within core. Hence,
for each CPU, a pager EC in core pages all Genode threads running on the same
CPU.


The operation of pager ECs
--------------------------

When an EC triggers a page fault, the faulting EC implicitly performs an
IDC call to its pager. The IDC message contains the fault information.
For resolving the page fault, core follows the procedure
described in [Page-fault handling]. If the lookup for a dataspace within
the faulter's region map succeeds, core establishes
a memory mapping into the EC's PD by invoking the asynchronous map operation
of the kernel and replies to the IDC message. In the case where the region lookup
within the thread's corresponding region map fails, the faulted thread
is retained in a blocked state via a kernel semaphore.
In the event that the fault is later resolved by a region-map client
as described in the paragraph "Region is empty" of Section
[Page-fault handling], the semaphore gets released, thus resuming the execution of
the faulted thread. The faulter will immediately trigger another fault at the
same address. This time, however, the region lookup succeeds.


Mapping database
----------------

NOVA tracks memory mappings in a data structure called _mapping database_
and has the notion of the delegation of memory mappings (rather than the
delegation of memory access). Memory access can be delegated only if the
originator of the delegation has a mapping. Core is the only exception because
it can establish mappings originating from the physical memory space.
Because mappings can be delegated transitively between PDs, the mapping
database is a tree where each node denotes the delegation of a mapping.
The tree is maintained in order to enable the kernel to rescind the authority.
When a mapping is revoked, the kernel implicitly cancels all transitive
mappings that originated from the revoked node.


Asynchronous notifications on NOVA
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To support asynchronous notifications as described in Section
[Asynchronous notifications], we extended the NOVA kernel semaphores to
support signalling via chained NOVA semaphores. This extension enables the
creation of kernel semaphores with a per-semaphore value, which can be bound to
another kernel semaphore. Each bound semaphore corresponds to a Genode signal
context. The per-semaphore value is used to distinguish different sources of
signals.

On this base platform, the blocking of the signal thread at the signal
source is realized by using a kernel semaphore shared by the PD session
and the PD client. All chained semaphores (Signal contexts) are bound to this
semaphore. When first issuing a _wait-for-signal_ operation
at the signal source, the client requests a capability selector for the shared
semaphore _(repos/base-nova/include/signal_session/source_client.h)_. It then
performs a _down_ operation on this semaphore to block.

If a signal sender issues a submit operation on a Genode signal
capability, then a regular NOVA kernel semaphore-up syscall is used. If the
kernel detects that the used semaphore is chained to another semaphore, the up
operation is delegated to the one received during the initial _wait-for-signal_
operation of the signal receiving thread.

In contrast to other base platforms, Genode's signal API is supported by the
kernel so that the propagation of signals does not require any interaction with
core's PD service. However, the creation of signal contexts is arbitrated by
the PD service.


IOMMU support
~~~~~~~~~~~~~

As discussed in Section [Direct memory access (DMA) transactions], misbehaving
device drivers may exploit DMA transactions to circumvent their component
boundaries. When executing Genode on the NOVA microhypervisor, however,
bus-master DMA is subjected to the IOMMU.

The NOVA kernel
applies a subset of the (MMU) address space of a protection domain
to the (IOMMU) address space of a device. So the device's
address space can be managed in the same way as one normally manages the address
space of a PD. The only missing link is the assignment of device address
spaces to PDs. This link is provided by the dedicated system
call _assign_pci_ that takes a PD capability selector and a device identifier as
arguments. The PD capability selector represents the authorization over the
protection domain, which is going to be targeted by DMA transactions.
The device identifier is a virtual address where the extended PCI
configuration space of the device is mapped in the specified PD.
Only if a user-level device driver has access to the extended PCI
configuration space of the device, is it able to get the assignment in place.

To make NOVA's IOMMU support available to Genode,
the ACPI driver has the ability to lookup the extended PCI configuration
space region for all devices and reports it via a Genode ROM. The platform
driver on x86 evaluates the reported ROM and uses the information to obtain
transparently for platform clients (device drivers) the extended PCI
configuration space per device. The platform driver uses a NOVA-specific
extension (_assign_pci_) to the PD session interface to associate a PCI device
with a protection domain.

; XXX [image img/iommu_aware 63%]
;   NOVAs management of the IOMMU address spaces facilities the use of
;   driver-local virtual addresses as DMA addresses.

Even though these mechanisms combined should in theory
suffice to let drivers operate with the IOMMU enabled, in practice, the
situation is a bit more complicated. Because NOVA uses the same
virtual-to-physical mappings for the device as it uses for the process, the DMA
addresses the driver needs to supply to the device must be virtual addresses
rather than physical addresses. Consequently, to be able to make a device
driver usable on systems without IOMMU as well as on systems with IOMMU, the
driver needs to become IOMMU-aware and distinguish both cases. This is an
unfortunate consequence of the otherwise elegant mechanism provided by NOVA. To
relieve the device drivers from worrying about both cases, Genode decouples
the virtual address space of the device from the virtual address space of the
driver. The former address space is represented by a dedicated protection
domain called _device PD_ independent from the driver. Its sole purpose
is to hold mappings of DMA buffers that are accessible by the associated
device. By using one-to-one physical-to-virtual mappings for those buffers
within the device PD, each device PD contains a subset of the physical address
space. The platform driver performs the assignment of device PDs to PCI
devices. If a device driver intends to use DMA, it allocates a new DMA buffer
for a specific PCI device at the platform driver.
The platform driver responds to such a request by allocating a RAM dataspace at core,
attaching it to the device PD using the dataspace's physical address as virtual
address, and by handing out the dataspace capability to the client. If the driver
requests the physical address of the dataspace, the address returned will be a
valid virtual address in the associated device PD.
This design implies that a device driver must allocate DMA buffers at the
platform driver (specifying the PCI device the buffer is intended for) instead
of using core's PD service to allocate buffers anonymously.

; XXX [image img/iommu_agnostic 80%]
;   By modelling a device address space as a dedicated process (device PD),
;   the traditional way of programming DMA transactions can be maintained,
;   even with the IOMMU enabled.



Genode-specific modifications of the NOVA kernel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NOVA is not ready to be used as a Genode base platform as is. This section
compiles the modifications that were needed to meet the functional requirements of
the framework. All modifications are maintained at the following
repository:

:Genode's version of NOVA:

  [https://github.com/alex-ab/NOVA.git]

The repository contains a separate branch for each version of NOVA that has
been used by Genode. When preparing the NOVA port using the port description
at _repos/base-nova/ports/nova.port_, the NOVA branch that matches the used
Genode version is checked out automatically. The port description refers to
a specific commit ID. The commit history of each branch within the NOVA
repository corresponds to the history of the original NOVA kernel
followed by a series of Genode-specific commits. Each time NOVA is updated,
a new branch is created and all Genode-specific commits are rebased on top of
the history of the new NOVA version.
This way, the differences between the original NOVA kernel and the Genode
version remain clearly documented. The Genode-specific modifications solve the
following problems:

:Destruction of kernel objects:

  NOVA does not support the destruction of kernel objects. I.e., PDs and
  ECs can be created but not destroyed. With Genode being a dynamic system,
  kernel-object destruction is a mandatory feature.

:Inter-processor IDC:

  On NOVA, only local ECs can receive IDC calls. Furthermore each local EC
  is bound to a particular CPU (hence the name "local EC"). Consequently,
  synchronous inter-component communication via IDC calls is possible only
  between ECs that both reside on the same CPU but can never cross CPU
  boundaries. Unfortunately, IDC is the only mechanism for the delegation
  of capabilities. Consequently, authority cannot be delegated between
  subsystems that reside on different CPUs. For Genode, this scheme is
  too rigid.

  Therefore, the Genode version of NOVA introduces inter-CPU IDC calls.
  When calling
  an EC on another CPU, the kernel creates a temporary EC and SC on the
  target CPU as a representative of the caller. The calling EC is blocked.
  The temporary EC uses the same UTCB as the calling EC. Thereby, the
  original IDC message is effectively transferred from one CPU to the other.
  The temporary EC then performs a local IDC to the destination EC using
  NOVA's existing IDC mechanism. Once the temporary EC receives the reply
  (with the reply message contained in the caller's UTCB), the kernel
  destroys the temporary EC and SC and unblocks the caller EC.

:Support for priority-inheriting spinlocks:

  Genode's lock mechanism relies on a yielding spinlock for protecting the
  lock meta data. On most base platforms, there exists the invariant that
  all threads of one component share the same CPU priority. So priority
  inversion within a component cannot occur. NOVA breaks this invariant
  because the scheduling parameters (SC) are passed along IDC call chains.
  Consequently, when a client calls a server, the SCs of both client
  and server reside within the server. These SCs may have different
  priorities. The use of a naive spinlock for synchronization will produce
  priority inversion problems. The kernel has been extended with the
  mechanisms needed to support the implementation of
  priority-inheriting spinlocks in userland.

:Combination of capability delegation and translation:

  As described in
  Section [Capability delegation through capability invocation],
  there are two cases when a capability is specified as an RPC argument.
  The callee may already have a capability referring to the specified
  object identity. In this case, the callee expects to receive the corresponding
  local name of the object identity. In the other case, when the callee
  does not yet have a capability for the object identity, it obtains a new
  local name that refers to the delegated capability.

  NOVA does not support this mechanism per se.
  When specifying a capability selector as map item for an IDC call,
  the caller has to specify whether a new mapping should be created or
  the translation of the local names should be performed by the kernel.
  However, in the general case, this question is not decidable by the caller.
  Hence, NOVA had to be changed to take the decision depending on the
  existence of a valid translation for the specified capability selector.

:Support for deferred page-fault resolution:

  With the original version of NOVA, the maximum number of threads is limited
  by core's stack area:
  NOVA's page-fault handling protocol works completely synchronously. When a
  page fault occurs, the faulting EC enters its page-fault portal and thereby
  activates the corresponding pager EC in core. If the pager's lookup for a
  matching dataspace within the faulter's region map succeeds, the page fault
  is resolved by delegating a memory mapping as the reply to the page-fault
  IDC call. However, if a page fault occurs on a managed dataspace, the pager
  cannot resolve it immediately. The resolution must be delayed until the
  region-map fault handler (outside of core) responds to the fault signal. In
  order to enable core to serve page faults of other threads in the meantime,
  each thread has its dedicated pager EC in core.

  Each pager EC, in turn, consumes a slot in the stack area within core. Since
  core's stack area is limited, the maximum number of ECs within core is
  limited too. Because one core EC is needed as pager for each thread outside
  of core, the available stacks within core become a limited resource
  shared by all CPU-session clients. Because each Genode component is a client
  of core's CPU service, this bounded resource is effectively shared among all
  components. Consequently, the allocation of threads on NOVA's version of
  core represents a possible covert storage channel.

  To avoid the downsides described above, we extended the NOVA IPC reply system
  call to specify an optional semaphore capability selector. The NOVA kernel
  validates the capability selector and blocks the faulting thread in the
  semaphore. The faulted thread remains blocked even after the pager has
  replied to the fault message. But the pager immediately becomes available for
  other page-fault requests. With this change, it suffices to maintain only
  one pager thread per CPU for all client threads.

  The benefits are manifold. First, the base-nova implementation converges
  more closely to other Genode base platforms. Second, core can not run out of
  threads anymore as the number of threads in core is fixed for a given setup.
  And the third benefit is that the helping mechanism of NOVA can be leveraged
  for concurrently faulting threads.

:Remote revocation of memory mappings:
  In the original version of NOVA, roottask must retain mappings to all memory
  used throughout the system. In order to be able to delegate a mapping to
  another PD as response of a page fault, it must possess a local mapping
  of the physical page.
  Otherwise, it would not be able to revoke the mapping later on
  because the kernel expects roottask's mapping node as a proof of the
  authorization for the revocation of the mapping.
  Consequently, even though roottask never touches memory handed out to other
  components, it needs to have memory mappings with full access rights
  installed within its virtual address space.

  To relieve Genode's roottask (core) from the need to keep local mappings
  for all memory handed out to other components and thereby let core
  benefit from a sparsely populated address space as described in Section
  [Sparsely populated core address space] for base-hw, we changed the kernel's
  revoke operation to take a PD selector and a virtual address within the
  targeted PD as argument. By presenting the PD selector as a token of
  authorization over the entire PD, we do no longer need core-locally
  installed mappings as the proof of authorization. Hence, memory mappings can
  always be installed directly from the physical address space to the target
  PD.

:Support for write-combined access to memory-mapped I/O resources:
  The original version of NOVA is not able to benefit from write combining
  because the kernel interface does not allow the userland to specify
  cacheability attributes for memory mappings. To achieve good throughput to
  the framebuffer, write combining is crucial. Hence, we extended the kernel
  interface to allow the userland to propagate cacheability attributes to the
  page-table entries of memory mappings and set up the x86 page attribute
  table (PAT) with a configuration for write combining.

:Support for the virtualization of 64-bit guest operating systems:
  The original version of NOVA supports 32-bit guest operations only.
  We enhanced the kernel to also support 64-bit guests.

:Resource quotas for kernel resources:
  The NOVA kernel lacks the ability to adopt the kernel memory pool to the
  behavior of the userland. The kernel memory pool has a fixed size, which
  cannot be changed at runtime. Even though we have not removed this
  principal limitation, we extended the kernel with the ability to
  subject kernel-memory allocations to a userlevel policy at the granularity
  of PDs. Each kernel operation that consumes kernel memory is accounted
  to a PD whereas each PD has a limited quota of kernel memory. This
  measure prevents arbitrary userland programs to bring down the entire
  system by exhausting the kernel memory. The reach of damage is limited to
  the respective PD.

:Asynchronous notification mechanism:
  We extended the NOVA kernel semaphores to support signalling via chained
  NOVA semaphores. This extension enables the creation of kernel semaphores
  with a per-semaphore value, which can be bound to another kernel semaphore.
  Each bound semaphore corresponds to a Genode signal context. The
  per-semaphore value is used to distinguish different sources of signals. Now,
  a signal sender issues a submit operation on a Genode signal capability via a
  regular NOVA semaphore-up syscall. If the kernel detects that the used
  semaphore is chained to another semaphore, the up operation is delegated to
  the chained one. If a thread is blocked, it gets woken up directly and the
  per-semaphore value of the bound semaphore gets delivered. In case no thread
  is currently blocked, the signal is stored and delivered as soon as a thread
  issues the next semaphore-down operation.

  Chaining semaphores is an operation that is limited to a single level, which
  avoids attacks targeting endless loops in the kernel. The creation of such
  signals can solely be performed if the issuer has a NOVA PD capability with
  the semaphore-create permission set. On Genode, this effectively reserves the
  operation to core. Furthermore, our solution preserves the invariant of the
  original NOVA kernel that a thread may be blocked in only one semaphore at
  a time.

:Interrupt delivery:
  We applied the same principle of the asynchronous notification extension
  to the delivery of interrupts by the NOVA kernel. Interrupts are delivered
  as ordinary Genode signals, which alleviate of the need for one thread per
  interrupt as required by the original NOVA kernel. The
  interrupt gets directly delivered to the address space of the driver
  in case of a Message Signalled Interrupt (MSI), or in case of a shared
  interrupt, to the x86 platform driver.

Known limitations of NOVA
~~~~~~~~~~~~~~~~~~~~~~~~~

This section summarizes the known limitations of NOVA and the NOVA version of
core.

:Fixed amount of kernel memory:
  NOVA allocates kernel objects out of a memory pool of a fixed size. The pool
  is dimensioned in the kernel's linker script
  _nova/src/hypervisor.ld_ (at the symbol '_mempool_f').

:Bounded number of object capabilities within core:
  For each capability created via core's PD service,
  core allocates the corresponding NOVA portal or NOVA semaphore and maintains
  the capability selector
  during the lifetime of the associated object identity. Each allocation of
  a capability via core's PD service consumes one entry in core's capability
  space. Because the space is bounded, clients of the service could misuse
  core's capability space as covert storage channel.

; XXX further possible topics
; * Locking
; * Introduce the threads running within core
;   (why to use separate a separate thread for signals?)
; * Shared interrupts

